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Abstract 

With systems for acquiring 3D surface data being ever- 
more commonplace, it has become important to reliably 
extract specific shapes from the acquired data. In the pres- 
ence of noise and occlusions, this can be done through the 
use of statistical shape models, which are learned from 
databases of clean examples of the shape in question. In 
this paper, we analyze and compare two types of such 
models: global models and local models. We first review 
how both types of models have been used in the litera- 
ture, then proceed to define the models and analyze them 
theoretically, in terms of both their statistical and compu- 
tational aspects. We then perform extensive experimental 
comparison to assess which type of model is better for a 
given task. Due to the wide availability of databases of 
high-quality data, we use the human face as the specific 
shape we wish to extract from corrupted data. 

1 Introduction 

Whether through laser range scanners, stereo reconstruc- 
tion or structured light, methods and systems for 3D sens- 
ing and acquisition are now commonplace. However, 
these systems typically incur some amount of noise. Fur- 
ther, if we are interested in a particular object within data, 
it may be occluded by other objects. To extract the shape 
of an object of a particular class from such data it is often 
advantageous to use a statistical shape model to ensure 
that the extracted shape is valid for that object class. In 
this paper, we describe, analyze and compare global and 
local variants of such a model. 

Such a statistical model must be learned from a 
database of instances from the object class in question. 
One class of shape for which there exist a number of avail- 
able databases, is the shape of the human face, and this is 
the shape which we use for evaluation in this paper. How- 
ever, we emphasize that the principles, models and algo- 
rithms discussed in this paper are applicable to any class 
of shapes for which a parameterized database is available. 
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The main reason to use a statistical shape model to fit to 
data, instead of a template fitting with a non-rigid iterative 
closest point (ICP) approach and some kind of regulariza- 
tion constraint, is the reduction of the search space, which 
results in the ability to reconstruct the underlying shape in 
the presence of severe noise or occlusions. These types of 
ambiguities are often present in real-world data captured 
in uncontrolled environments. 

The purpose of this paper is to provide researchers and 
engineers with an analysis and comparison of the costs 
and benefits of global and local statistical models. Toward 
this end we make the following contributions: 

• An analysis of the theoretical properties of both 
global and local statistical shape models. 

• A quantitative and qualitative analysis, and extensive 
comparative evaluation of the practical performance 
of both global and local statistical shape models. 

2 Related Work 

This section reviews work related to performing statistical 
shape analysis for image processing applications. These 
applications include the analysis of two-dimensional im- 
ages, videos, and three-dimensional images given either 
as volumetric images, such as magnetic resonance images, 
as three-dimensional point clouds, or as two-dimensional 
range images, such as images captured using depth cam- 
eras. 

The first step to performing shape analysis is to acquire 
and register a set of training shapes that capture the shape 
variability that is of interest for a particular application. 
Subsequently, statistical shape analysis is performed on 
the registered set of training shapes to obtain a prior distri- 
bution for the shapes of interest. Without correspondence 
information, this statistical analysis is not possible. 

Computing correspondences between a population of 
shapes is a challenging problem, and a detailed discus- 
sion about possible approaches is beyond the scope of this 
work. We refer the reader to a recent survey for more in- 
formation [38]. 

In computer vision, statistical 3D shape models are 
commonly used to infer the three-dimensional shape of 



an object from images, mostly for the purpose of image 
manipulation. While recently, different classes of shapes 
have been considered ifTOll , shape models of human face 
shapes and human body shapes are of special interest. In 
medical image analysis, statistical shape models are com- 
monly used to segment medical images and to find cor- 
respondences and abnormalities of anatomical shapes, re- 
spectively. 

2.1 Human Face Shapes 

We start our review by summarizing the use of statistical 
shape models of human faces. Blanz and Vetter (6) pro- 
posed the first statistical shape model for human faces. 
The model, called morphable model, captures both 3D 
shape and texture information and can be used to predict a 
3D face shape from a single input image. A parameterized 
database of 3D face scans, mostly in neutral expression, 
is used to learn the statistical model using standard prin- 
cipal component analysis (PC A) (see Section [4]). Given 
an input image in neutral expression, the learned shape 
space is searched to find the textured shape that best ex- 
plains the input image using an optimization technique. 
This successful approach resulted in multiple follow-up 
works 1211301. 

Brunton et al. |9] use a statistical analysis based on 
wavelet models to learn a localized prior distribution for 
the 3D shape of a human face (see Section [5]). This al- 
lows to capture and combine localized shape variations in 
different areas of the face independently. They use this 
information to predict the 3D face shape from two stereo 
photographs. As in the work by Blanz and Vetter, all faces 
are assumed to have a neutral facial expression. 

To allow for varying facial expressions, Vlasic et 
al. |39l use a statistical method based on tensor algebra 
called multilinear model to analyze a set of faces cap- 
tured of subjects performing a variety of facial expres- 
sions. This allows for the prediction of a 3D face shape 
from a single photograph. 

Yang et al. l42l exchange the expression of a face in a 
single image based on a different input image of the same 
subject. For this application, they build multiple PC A 
spaces (one per expression) and combine these spaces for 
their application. 

More recently, much work has focused on extracting 
a set of frames of three-dimensional face shapes from a 
video stream showing a face. The output of this type 
of algorithm is a four-dimensional sequence showing the 
three-dimensional face shape in motion. Dale et al. fT5l 
extend the method by Vlasic et al. to compute such a four- 
dimensional sequence. This information is then used to 



exchange faces in video sequences. 

Another avenue of recent work is to track two- 
dimensional range images over time. These images can 
be captured using depth sensors, such as the Kinect sen- 
sor. Weise et al. l40l propose a method to track a three- 
dimensional face model over time using prior information 
on the deformation model, and to use the tracked model 
to drive the animation of a virtual character in real-time. 
Unlike the other methods discussed in this section, this 
method learns a statistical prior that is subject- specific. In 
this paper, we do not consider subject- specific priors. 

2.2 Human Body Shapes 

Allen et al. Q] proposed a statistical model for human 
body shapes in a standard posture that is similar to the 
morphable face model introduced by Blanz and Vetter. 
One main difference is that Allen et al. only learn in- 
formation about the 3D shape and not about texture. This 
model has been used to predict a 3D human body shape in 
a standard posture from one or more images l35l[TTl [7l. 

To allow for posture variation, Anguelov et al. l3ll 
propose the SCAPE model. This model learns a PCA 
shape space for body shape variations using a database 
containing multiple subjects in a standard posture. Fur- 
thermore, the model learns a mapping from posture pa- 
rameters (based on a skeleton) to shape changes using a 
database containing one subject in multiple poses. The 
model then combines the two variations using the assump- 
tion that body shape and posture are decorrelated. Since 
the SCAPE model successfully models human body shape 
and posture, it has been used to predict a 3D body shape in 
arbitrary posture from a single image [19]. Furthermore, 
this model can be used to predict a 3D human body shape 
in arbitrary posture based on a set of input images of a 
dressed person [ 4 ]. Such a 3D prediction can then be used 
to modify the input image |45l . Just like in the case of 
human faces, more recently, much work has focused on 
finding a four-dimensional sequence of three-dimensional 
human body shapes in motion from a video sequence. Jain 
et al. 1 23 1 use this to modify human body shapes in video 
sequences. Weiss et al. |41 ] propose to use the SCAPE 
model to compute a 3D body scan from noisy Kinect data. 

A different avenue to allow for posture variation is to 
model shape and posture changes as correlated. This as- 
sumption is relevant, since the difference of the human 
body shape of the same subject in different postures de- 
pends on the body shape, e.g. on how muscular the sub- 
ject is. Hasler et al. [ 22 ] propose a shape space that jointly 
captures shape and posture variations by performing PCA 
on a rotation-invariant encoding of the shapes. This shape 
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space is then used to predict the body shape of a dressed 
subject l2T1l . 

An alternative for correlating human shape and posture 
variations is to use a multilinear model (as Vlasic et al. do 
for face shapes). This avenue was explored by Hasler et 
al. l20l for the application of predicting 3D body shape 
and posture from an image. 

2.3 Human Organ Shapes 

In medical imaging, one is especially interested in a body 
part, such as an organ or part thereof. Tasks of interest 
include finding the shape of interest in a medical image. 
This decomposition of the image is often called segmen- 
tation. To solve this task, Cootes et al. lfT3l propose the 
use of a statistical prior called active shape model. The 
active shape model learns the distribution of a set of reg- 
istered and aligned training shapes using PCA, and uses 
this prior information to segment a given medical image. 
This model is commonly used for image segmentation, 
see Cootes and Taylor lfT2l and references therein. 

One problem with active shape models is that they cap- 
ture global shape variations. In medical imaging, one is 
often interested in detecting localized shape anomalies, as 
these can give insights in whether or not a specific organ 
is affected by a disease, for instance. In an active shape 
model, such local variations may be distributed over sev- 
eral principal components, and they may be controlled by 
principal components that capture a small percentage of 
the overall shape variability. 

To remedy this, Davatzikos et al. lfl~6l used statistical 
analysis based on wavelet models to learn a localized prior 
distribution of contours in images. Nain et al. l28l ex- 
tended this technique to use wavelets to perform a statis- 
tical analysis of three-dimensional shapes. Shape priors 
based on different types of wavelet models have been used 
to segment medical images (29l EH IT8l . Yu et al. O 
show that statistical wavelet models can be used to ana- 
lyze cortical folding patterns, which is a challenging task. 

In the following, we provide a comparative evaluation 
of a global shape model based on principal component 
analysis and a local shape model based on a spherical 
wavelet representation. Our comparison uses as applica- 
tion a model fitting approach for human faces. 

3 Statistical Shape Analysis for 
Model Fitting 

In this section we give an overview of the generic model 
fitting approach. This assumes that a statistical shape 



model has been learned, for instance using the approaches 
given in Sections]?] and [5] 

We represent a shape with a set of shape parameters si 
for i = 1, . . . , d, which form a vector s G R d . A generator 
function 

F(s) : R d -> R 3n (1) 

generates from these shape parameters a surface mesh of 
n vertices. These shape parameters, and by extension the 
surface, are fit to an input point cloud P of m points. We 
do not assume any organization of the point cloud. 

3.1 Training Data and Parameterization 

For training, we use the neutral expressions of T = 100 
subjects from the BU-3DFE database |43|. This database 
contains relatively clean surfaces without occlusions, and 
a typical cropped face contains about 7500 vertices. Fur- 
thermore, each cropped face is equipped with 83 land- 
mark points. 

We parameterize the database using the method of 
Salazar et al. |32] that deforms a template to each in- 
put face. This method is capable of predicting landmark 
points to aid in the template fitting. However, since we 
are given ground truth landmarks, our algorithm uses the 
true landmarks instead of predicted ones. This removes 
a potential source of error during registration. The tem- 
plate we use contains 2816 vertices. We choose this 
low-resolution template for parameterization since the 
database has low resolution and does not contain small 
shape details. While the BU-3DFE database contains six 
additional expressions in four different levels, we consider 
only neutral expressions in this paper. 

Once the database is parameterized, we need to align 
the data to remove shape variations due to misalignment. 
We pre-align the data to remove rotation, translation, and 
uniform scale differences using generalized Procrustes 
analysis ifTTl . Note that by removing uniform scale dif- 
ferences, we only consider shape differences and not size 
differences of the models. The generalized Procrustes al- 
gorithm iteratively aligns each model to the mean shape 
and recomputes the mean. Note that removing transfor- 
mations that are not of interest is an important preprocess- 
ing step that yields better statistical models. 

3.2 Test Data 

In general, the models presented in Sections [4] and [5] can 
be fit to any 3D point cloud of a face. In this paper, we 
evaluate using a subset of the Bosphorus database l33l . 
This database contains clean faces as well as faces with 
different forms of occlusions, such as glasses, hand over 
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Figure 1: Bosphorus model with landmarks. Red land- 
marks are used for initial alignment, green landmarks are 
used for error evaluation. 



the eye or mouth, facial hair, or hair in front of the face. 
The resolution of this database is fairly high, and a typi- 
cal face contains about 35000 vertices. Furthermore, each 
face is annotated with up to 22 landmarks. Figure[T] shows 
a model of the Bosphorus database with 22 landmark posi- 
tions. As discussed in the following, the landmarks shown 
in red are used to compute an initial alignment of the test 
face to the learned shape space and the landmarks shown 
in green are used for error evaluation. Not all of the land- 
marks may be present in the database for two reasons: 
first, landmarks may be missing due to occlusion, and 
second, some landmarks are placed erroneously and we 
manually removed these landmarks. 

3.3 Initial Alignment 

To fit a statistical shape model to an input dataset, we first 
need to align the input data and the statistical shape model 
to be in the same global coordinate system. Since we con- 
sider only shape differences in the training data, the initial 
alignment aims to find the rotation, translation, and uni- 
form scaling that best aligns the statistical shape model 
with the input data. 

To compute such an initial alignment, corresponding 
landmarks are commonly used. These landmarks can be 
manually located on the mean shape of the aligned train- 
ing database once. On the input data, the landmarks can 
be predicted in a fully automatic way using the method 
of Creusot et al. fl4l . However, since in this paper, we 
use a test database that contains a set of landmarks, we 
choose to use a subset of these landmarks (the ones shown 
in red in Figure [T]) to compute an initial alignment. This 
approach removes a potential source of fitting error due to 
landmark prediction inaccuracies. 

3.4 Energy Minimization in Shape Space 

Our goal is to fit the statistical shape model to the input 
data as closely as possible while staying in the learned 



shape space. To fit our model to data, we minimize an en- 
ergy function that amounts to the sum of squared distances 
between each model vertex and its nearest neighbor in the 
input point cloud. Specifically, in this paper we use the 
following energy to pull the model towards the data 
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where is vertex i of F(s) (see Equation [I]), p G Pis 
a point in the input point cloud, NN(z) returns the index 
of the nearest neighbor in P of f^, and r is a truncation 
threshold to add robustness against outliers. We compute 
nearest neighbors with a k-d tree using the implementa- 
tion in ANN [3 

The training data is used to compute a statistical shape 
space. A commonly used assumption is to model the prior 
distribution as a multidimensional Gaussian distribution. 
The energy E^ ata is then often paired with an energy that 
is proportional to the logarithm of said prior probability 
distribution computed during training. Such a statistical 
prior provides strong regularization. In fact, it is often too 
strong, pulling the fitted model towards the mean shape 
and away from the input data, resulting in a lack of detail 
and distinctiveness in the fitted shape. 

Patel and Smith |[3T1 proposed an alternative prior that 
is aimed at maintaining the distinctiveness of the mod- 
els. They model the shape space as a manifold that is 
at a constant Mahalanobis distance from the mean. This 
is based on the observation that the squared Mahalanobis 
distances from the mean of a set of d-dimensional nor- 
mally distributed vectors form a Xd distribution with ex- 
pected value equal to d. Hence, Patel and Smith restrict 
the shapes to be on the hyper-ellipsoid at Mahalanobis dis- 
tance \[d from the mean in order to preserve shape dis- 
tinctiveness. While this approach models distinctiveness 
using the expected Mahalanobis distance from the mean, 
it does not consider the normal distributions along each 
dimension of the shape space. That is, the modeled shape 
space contains highly unlikely shapes along the directions 
of the principal components, as can be seen for the shape 
at the intersection of the hyper-ellipsoid shown in red and 
the x-axis in Figure [2] 

When fitting a statistical shape model to data, the 
space of possible solutions should only contain likely 
shapes. This ensures that only plausible results are pos- 
sible. Hence, in this paper we instead choose to con- 
strain the shape to lie within the hyper-box of ±cc^ about 
the mean shape, where is the standard deviation of 
the training data along dimension i of the learned statis- 
tical space, and c is a parameter controlling the amount 



1 http://www.cs.umd.edu/ mount/ ANN/ 
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Figure 2: Unrealistic shapes occur along the directions of 
the principal components when maintaining a fixed Ma- 
halanobis distance from the mean shape. Here, a global 
statistical shape space with d = 50 and c = 1 is shown. 

of deviation allowed. Figure [2] shows a two-dimensional 
plot of this hyper-box. We demonstrate that in this shape 
space, by minimizing an energy function with only the 
data term given in Equation [2] we can maintain distinc- 
tiveness, while avoiding unrealistic shapes. This is equiv- 
alent to a prior probability of the form 



P(s) = l[P i ( Si ) 



where 



Pi(*i) 



1 \Si\< COi 

otherwise 



(3) 



(4) 



if we assume the shape parameters si are centered (mean 
subtracted). We call this a hyper-box prior. 



4 Global Statistical Shape Analysis 

We first describe a commonly used statistical shape space 
that captures the global shape variability of a set of train- 
ing shapes. 



4.1 PCA 

Principal Component Analysis is a method which aims to 
reduce the complexity of a set of data. Due to its simplic- 
ity it is widely used for shape analysis. PCA is a linear 

with 



p3n 



to 



transformation of a set of vectors from 
d < 3n. A vector f e R 3n is expressed by the scalar 
weights Si in a d-dimensional subspace, spanned by the 



orthogonal vectors V;, by 



F(s) : f = F 



where F = ^J2t=i F- tram ^ is the mean of the training 
data. For each parameterized shape of the training set we 
have one vector jf^ram) ^ ^ 3n contains an ordered 
coordinate set of all points of the i-th training shape. The 
vectors V 2 - are the eigenvectors of the data covariance ma- 
trix 



i=l 



(5) 
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The eigenvectors V$ are ordered with respect to the non- 
increasing corresponding eigenvalues A^. The eigenvalues 
Xi measure the variability captured by the i-th principal 



component. More specifically, V$ captures 100 ^ 



% 



of the variability of the training data. The rank of the 
data covariance matrix is at most min(3n — 1,T — 1) 
and therefore the number of distinct non-zero eigenval- 
ues and hence, the number or principal components, is at 
most min(3n — 1, T — 1). 

4.2 Evaluation of Statistical Model 

If a statistical model with a small number of principal 
components is fitted to a face, the result contains little 
shape detail, because the model only represents a small 
proportion of the variability of the training data. For a 
large number of components the model tends to overfit 
the data. That is, the learned shape space may contain 
unrealistic face shapes. 

To pick a number of principal components d that pre- 
serves a high amount of variability yet does not overfit 
the training data, we evaluate the global PCA model us- 
ing the following three error measures similar to compact- 
ness, generalization, and specificity (36). We use a slight 
modification of the original error measures to obtain re- 
sults that are independent of the size of the training data. 

Compactness measures how much variability of the 
training data is explained by the learned statistical model. 
For d principal components, compactness is defined as 



T-l 



(7) 



2=1 



2 = 1 



where is the i-th eigenvalue of the data covariance ma- 
trix. The first row of Figure [3] shows the compactness plot 
for our training data. 
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Figure 3: Compactness, generalization and specificity. 



Generalization measures the ability of the model to rep- 
resent data, which are not part of the training set. To 
calculate this measure, we learn a PCA model on a sub- 
set of the training data, where one subject is excluded. 
The excluded subject is projected to the PCA space, re- 
constructed, and the distance between the source and the 
reconstruction is measured. To measure the distance be- 
tween two faces, we use the average Euclidean vertex dis- 
tance computed between all corresponding vertices. We 
perform this measurement for all subjects. The average 
and standard deviation of the distances over all subjects is 
shown in the second row of Figure [3] 

Specificity measures the similarity between reconstruc- 
tions from the statistical model and the training data. This 
estimates the plausibility of a random face represented us- 
ing the learned shape space. To calculate specificity we 
choose a set of random points sampled from the distribu- 
tion of the learned PCA space. For each of these points 
we compute the reconstruction using the PCA model and 
the distance to the closest face in the training data. The 
distance between two faces is computed as above. In the 
last row of Figure [3] 10000 random samples are chosen 
and the mean and standard deviation over all samples is 
shown. 

Our shape space (evaluated in Figure [3} should ideally 
be compact, general, and specific. The compactness plot 
shows that 50 principal components explain more than 
98% of the data variability. Furthermore, for more than 
50 components, the generalization error only decreases 
slightly, which implies that the benefit of choosing more 
components is small. Finally, the specificity error still 
increases for more than 50 components, which means 
the model represents plausible faces. Hence, we choose 
d = 50 for the global fitting method. This choice for the 
number of principal components is also the reason for us- 
ing a 50-dimensional shape space in Figure [2] 

4.3 Properties 

The global shape space represents the high-dimensional 
differences of the training faces in a low-dimensional 
shape space that is spanned by the corresponding eigen- 
vectors of the d largest eigenvalues of the data covariance 
matrix. 

Figure [4] shows the variations along the first two prin- 
cipal components in the range of — to -f c^, where 
denotes the standard deviation of the i-th principal com- 
ponent. 

The amount of details that can be expressed by the 
global statistical model is limited by the details present 
in the training data. It would therefore be useful in some 
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Figure 4: Variations of the first two principal components. 

applications to increase the variability of the training data 
by increasing the mesh resolution of the training data by 
inserting new vertices as linear combinations of existing 
vertices. Unfortunately, this is not possible using this 
global approach. If a vertex expressed as a fixed linear 
combination of existing vertices is inserted to each train- 
ing mesh, the corresponding additional vertex in the fitted 
surface is identical to the corresponding fixed linear com- 
bination. Therefore, if an additional vertex is chosen to be 
placed on a triangle, the corresponding additional vertex 
is located on the corresponding triangle of the fitted re- 
sult. Hence, using fixed linear combinations to add points 
to the surface of the model and fitting this extended sur- 
face to a target face leads to the same result as fitting the 
original model to the target face and adding the points into 
the resulting surface using the fixed linear combinations. 
This is a key difference to the local model reviewed in 
Section |3 

4.4 Energy Minimization 

While the energy in ^ is not strictly differentiate at all 
points, it is continuous, and the number of points where it 
is not differentiate is small. Hence, we can minimize it 
using a bounded Quasi-Newton method (26). The coordi- 
nate bounds on the parameters enforce the condition given 
in ([?]). This minimizer gives quadratic convergence rates 
without the need for an explicit inverse Hessian. Note that 



this optimization technique does not guarantee to find the 
global optimum of the energy function. 

4.5 Computational Complexity 

Let to denote the number of iterations required for the 
minimization to converge, and let m denote the number of 
data points in the target shape. The complexity of build- 
ing a k-d tree of m points in 3D is O(ralogra), and a 
single nearest neighbor search takes 0(m 2 / 3 ) time l24l . 
Given the nearest neighbor indices for all n points, a sin- 
gle evaluation of the energy given in Equation ^ takes 
0(n) time, and a single evaluation of its gradient takes 
0(nd) time. Thus, the overall time complexity of fit- 
ting the global PCA model to a dataset is 0(m log m + 
tGn(d-\-m 2 / 3 )). For this model, each training shape con- 
tains n = 2816 vertices. 

5 Local Statistical Shape Analysis 

We now describe an alternative statistical shape model in 
which a local basis is used to train many localized shape 
priors that describe the shape variation at different loca- 
tions and scales over the surface. 

5.1 Local Bases and Wavelet Transforms 

A wavelet transform decomposes sampled data by pro- 
jecting it onto a set of basis functions that are localized in 
space and frequency. 

Wavelet transforms were originally defined on reg- 
ularly sampled Euclidean domains l27l . Second- 
generation or lifting wavelets l37l are computed in time 
linear in the number of samples in the original signal using 
local lifting operations and sub- sampling at each scale. 
Specifically, the data are partitioned into maximally cor- 
related subsets, A and B. For a ID signal this would be 
even and odd samples. Then, a prediction operator V uses 
A to predict B. This prediction is subtracted from the 
actual values of B to give residuals or detail coefficients 
V = {d : d = b- Vb(A) V b e B}. An update oper- 
ator U then uses the detail coefficients to update A yield- 
ing approximation coefficients. The whole process is re- 
peated on the approximation coefficients. The prediction 
and update operators are collectively called lifting opera- 
tors, and typically have local support, which makes them 
fast to compute. The sub- sampling ensures that the overall 
complexity remains linear in the number of samples. 

Spherical wavelets l34l are defined on subdivision 
surfaces, typically topological spheres. The particu- 
lar wavelet basis we use is a biorthogonal generalized 
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B- spline basis that uses the Catmull-Clark subdivision 
scheme Q. The prediction and update operators are B- 
spline interpolations from the neighboring vertices. This 
scheme is stable for linear and cubic B -splines; in this pa- 
per we use the linear basis as we found it produced less os- 
cillatory surface artifacts in relatively flat surface regions. 

Wavelet transforms have the following properties. 
First, they are localized. This means that the basis func- 
tions have compact support in both spatial and frequency 
domains, and as a result, so do the wavelet coefficients. In 
our case, a wavelet coefficient captures the shape proper- 
ties of the surface at a particular location and scale. Sec- 
ond, they are decorrelating. This means that the wavelet 
coefficients represent approximately independent compo- 
nents of the signal projected into the wavelet basis. In the 
lifting scheme, this is achieved by choosing maximally 
correlated subsets, i.e. adjacent samples, and using one 
to predict the other and taking the residual. This residual, 
or detail coefficient, is independent of the other subset ac- 
cording the prediction model used by V. Third, orthog- 
onal, biorthogonal and other critically sampled wavelets 
are computed in linear time with respect to the number of 
samples. This comes at the loss of translation invariance, 
in contrast to overcomplete or redundant wavelets l27l . 

5.2 Localized PCA 

Performing PCA over the whole set of wavelet coeffi- 
cients would result in the same principal components as 
the global model, because the wavelet transform is a lin- 
ear transform, and PCA essentially just rotates the data to 
align with the directions of greatest variations. Instead, 
we perform PCA locally on each coefficient, which is a 
3D vector quantity, over the database. 

First, let us denote the mean of each wavelet coefficient 
over the database 

1 T 

i=l 

where k indexes the coefficients. 

While we can perform statistical analysis on each s k in- 
dependently of other values of k, we must consider their 
three components together. Each s k is a 3D vector repre- 
senting either the scale (absolute value) or the detail (rela- 
tive value) of the shape at a particular frequency and spa- 
tial location. However, the coordinate axes in general do 
not correspond to the directions of greatest variation in 
the database. Therefore, we perform PCA on each set of 
coefficient vectors, to obtain 3D vectors that represent 
the position along the directions of greatest variation, and 



3x3 matrices U k that transform these coordinates to our 
original world coordinate system, as in 

s k = s k + U k r k (9) 

where we write s k = [x k , y k , z k ) T and r k = [x k , y k , z k ) T 
to denote the components of these vectors. Applying the 
transform (U k ) T to the data diagonalizes the covariance 
matrix, thus making each component independent. 

The reconstruction of the face shape from the model is 
then given by the wavelet transform 

f( S ),= e *o«s o + E E (10) 

ogv(o) j=o iew(j) 

where i is the vertex index in the reconstructed surface, 
j is the level of wavelet coefficient, J is the number of 
levels used, V(0) is the set of scaling functions at level 
zero, W(j) is the set of wavelet functions at level j, o 
and I are the coefficient indices, 0g is the scaling func- 
tion at the coarsest level centered on location o, and ipj is 
the wavelet function at level j and location I. While the 
transform is expressed here in terms of basis functions, 
it is computed using lifting operators, which amount to 
weighted averages of a vertex's local neighborhood, and 
it can be expressed as a matrix multiplication. The basis 
functions themselves, <j>j and ipj, are B-spline approxima- 
tions to Gaussian and Mexican-hat functions. For more 
details, see Bertram et al. Q. 

To use a wavelet basis to represent shape, it must be 
a subdivision surface, in our case Catmull-Clark subdivi- 
sion hierarchy. For training, the surfaces in the database 
are typically stored as triangle meshes without subdivi- 
sion structure. Thus, we must resample the surfaces with 
the proper structure. The subdivision scheme uses quadri- 
lateral elements, although it can handle extraordinary ver- 
tices. We resample the triangle meshes using a custom, 
yet straightforward, technique tailored to the fact that we 
are dealing with faces, which are topologically like a disc. 
In the future, it would be interesting to explore more gen- 
eral methods. 

We empirically select a plane that is aligned with the 
front of the face, and stereographically project the trian- 
gles of the template mesh onto that plane. The surface is 
then resampled in a regular grid. This corresponds to re- 
sampling the surface at bary centric coordinates, and hence 
preserves correspondence. These same coordinates from 
the template mesh are used to resample each mesh in the 
training set. To ensure subdivision hierarchy, the grid is 
generated by starting with a low resolution base grid and 
subdividing it several times to get the full resolution grid. 
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5.3 Properties 



5.4 Energy Minimization 



The local model has the benefit that it avoids overfitting, 
and as a consequence we can keep all variability present 
in the training set. This is because the local surface prop- 
erties of any given surface point are not likely to be spe- 
cific to one set of faces or another. Whereas for the global 
model a bias in the training set, over-representation of one 
sex or a particular ethnicity or age range, can cause the 
lesser principal components to be highly specialized to 
that set, the geometry of a local surface patch is likely 
to be less dependent on the training data. 

Figure [5] shows the mean shape color-coded with the 
magnitude of the shape variability for six levels of the 
wavelet subdivision. This shows the localized shape vari- 
ations at different scales that are captured by this multi- 
resolution approach. 

The dimensionality of the local model is statistically 
more favorable. If, as is usually the case, the number of 
vertices is much greater than the number of training exam- 
ples, n ^> T, then the global model has problems of fitting 
to the particularities of the training set. In the local model, 
many independent statistical priors are learned, each with 
dimension 3. We have many more training examples than 
that. The independence of the local priors further allows 
an exhaustive search of the parameter space. Thus, we 
have no danger of getting trapped in local minima. 

The drawback of these properties, in particular of re- 
taining all the variability of the training data, is that the 
local model is a much higher-dimensional representation 
than the global model. Thus, the dimensionality of the lo- 
cal model is computationally much less favorable. There 
is, however, a trade-off that can be made by fitting the 
wavelet coefficients only up to a certain level, providing a 
less-detailed reconstruction in less time. 

As the lifting operations of the wavelet transform 
amount to local weighted averages of vertex coordinates, 
the transform can be expressed as a matrix multiplication, 
if the surface is expressed as a vector containing the vertex 
coordinates. Because the transform is biorthogonal, this 
matrix is square and has full rank. In contrast, the global 
PCA in Section [4j when expressed as a matrix whose 
columns are the principal components V*, i = 0, . . . , d, 
has rank d < min(3n — 1, T — 1). As discussed in Section 
[4j resampling the surface at linear combinations of vertex 
coordinates (eg., within a triangle), does not increase the 
rank of the transform. However, because the rank of the 
wavelet transform is determined by the number of ver- 
tices, we can obtain more detail by linearly upsampling 
the training surfaces. This means that we can resample 
the training shapes at high resolution to obtain a statistical 
model that captures fine shape detail. 



We minimize the energy ([2]) using a global search of each 
parameter. That is, we sample uniformly within the range 
given by the hyper-box prior Q for each component of 
for each k sequentially, starting with the coarsest resolu- 
tion coefficients and progressively increasing the resolu- 
tion. For each sampled value, we reconstruct the surface 
using J9]) and ( [TO] ), and evaluate the nearest neighbor en- 
ergy ([2JT The parameter value that minimizes this energy 
is then taken as the estimate for this parameter. 

5.5 Computational Complexity 

Since we use a sampling approach to minimize the energy 
for the local model, our complexity depends not on the 
number of iterations of a nonlinear optimization, but on 
the number of samples per parameter. Here, we de- 
note the number of wavelet coefficients by n. Note, how- 
ever, that n increases with increasing J since all training 
surfaces are resampled. In our experiments, we resam- 
ple all training shapes with n = 24897 vertices. The 
complexity of the nearest neighbor search remains un- 
changed, as does the cost of evaluating the energy. How- 
ever, the energy must be evaluated ^ times for each 
of the n coefficients. Hence, our overall complexity is 
0(m log m + n(m 2 / 3 +tLfi)), which is dominated by the 
0(n 2 tL) part. Thus, assuming n ^> to, we expect the 
local model, with its higher-dimensional representation to 
take much longer to fit to the same point cloud. However, 
a trade-off between detail and running time can be made 
by fitting coefficients only up to some level less than the 
number of levels in the wavelet decomposition. 

6 Comparative Evaluation 

In this section we evaluate both the fitting speed and qual- 
ity of the global and local models. 

6.1 Experimental Setup 

Implementation Details In all experiments, we set the 
truncation threshold r to 10mm, and we set the parameter 
c controlling the size of the hyper-box prior to 1.0. For 
the global model, we use 50 principal components. For 
the local model, we use a base grid of size 5x7, and we 
use at most J = 6 levels of subdivision. 

Test Database We use a subset of 20 subjects (10 female 
and 10 male) of the Bosphorus database to test our al- 
gorithm. Each subject is present in five occlusion levels: 
without occlusion, with glasses, with an occlusion of one 
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level 



level 1 



level 2 



level 3 



level 4 



level 5 



Figure 5: Mean shape color-coded with the magnitude of shape variability for different levels. All units are in meters. 



eye by a hand, with an occlusion of the mouth by a hand, 
and with an occlusion of parts of the face by hair. Exam- 
ples for each occlusion class can be seen in the left column 
of Figure [8] 

Timing The global model can be fit in under a minute per 
face for the data we are using. Depending on the num- 
ber of levels of coefficients that are optimized, the local 
model takes between about 10 minutes to over an hour 
to fit. Figure [6] shows the reconstruction of a Bosphorus 
model when optimizing the shape coefficients of the lo- 
cal shape space up to different levels. Note that the more 
levels are used, the more accurate the reconstruction be- 
comes. However, while the reconstruction up to level zero 
runs in under one minute, the reconstruction up to level 
five runs in slightly over one hour. This gives a way to 
trade off computation time and reconstruction accuracy. 
For all the experiments in the following, we evaluate the 
accurate reconstructions up to level 5. 

6.2 Landmark Distance 

We first evaluate the fitting quality using a subset of the 
landmarks (the ones shown in green in Figure [T]) located 
on the input data. Note that these landmarks are not used 
in the initial alignment. Also recall that not all of these 
landmarks are present in all target scans. We only evaluate 
the error for those landmarks present in a given scan. The 
landmarks that are present in the test data are considered 
the ground truth landmark locations. We manually placed 
all of the landmarks used for testing on the mean shape of 
the aligned training data. The position of these landmarks 
after fitting are the estimated landmark positions. The dis- 
tance between these estimates and the available ground 
truth landmark positions is the error we measure for the 
test data under different types of occlusion. 

Table Q] shows the statistics of the distances from the 
landmarks on the fitted model to those on the input data. 
Note that the local method produces lower mean and me- 
dian errors for all types of occlusion. This reflects the fact 
that we can keep all variability in the training data with- 
out the risk of overfitting. Thus, the local model better 



captures local shape detail. We see that in three cases (no 
occlusion, subject wearing glasses, occlusion of the eye), 
the local method produces higher maximum errors. This 
reflects the fact that each part of the surface is locally fit- 
ted, and that if the nearest neighbors in that region are 
beyond the threshold r, they do not pull sufficiently to- 
wards the surface. The global method's ability to capture 
the overall shape allows these points to get closer despite 
the lack of pull in that particular area. 

This evaluation is commonly considered an accurate 
form of error measurement. However, unfortunately, it 
is possible only for a small subset of surface points since 
we require ground truth landmarks for this test. In the 
following, we give a less accurate, yet more dense evalu- 
ation, using a shape distance. 

6.3 Surface Distance 

We now evaluate the distance between the input scan and 
the fitted model over the entire surface by computing the 
distance to the nearest neighbor on the input data for each 
point on the fitted model. Table [2] shows the nearest 
neighbor distance over the surfaces. This gives a lower 
bound on the fitting error in terms of semantically mean- 
ingful correspondences, but it can be computed for the en- 
tire surface. We see that the local model produces lower 
mean and median errors, and higher standard deviation 
and maximum error. This reflects the fact that there are a 
relatively few points where there is not enough pull from 
the energy function to the data, and these points are not 
fitted well. In areas of the surface where the data is close 
enough to the initial alignment, however, the local model 
better fits to the surface due to the retained variability in 
the model. Conversely, the global model does not get as 
close to any points, but the global information allows the 
overall shape to guide it in areas where the initial align- 
ment is not close enough to the data. 

Figure [7] shows the color coded average distances to the 
nearest neighbors for all points on the surface. For the re- 
sults without occlusion, in most regions of the face, the 
local shape space yields results that are closer to the input 
surface. However, at the nose tip and the chin, the global 
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level level 1 level 2 level 3 level 4 level 5 input data 

Figure 6: Reconstruction of a model from the Bosphorus database using different levels of the local shape space. 



Model 


Occlusion 


Mean 


Median 


Std. Dev. 


Max 




none 


6.99 


5.56 


4.94 


28.36 


global 


glasses 


8.06 


6.81 


5.77 


31.73 


eye 


7.66 


5.82 


5.48 


33.25 




mouth 


9.36 


6.97 


7.27 


38.19 




hair 


7.95 


6.18 


5.49 


37.69 




none 


6.16 


5.27 


4.40 


31.43 


local 


glasses 


7.43 


5.56 


5.97 


33.30 


eye 


6.67 


5.42 


4.95 


34.63 




mouth 


7.66 


6.45 


5.06 


33.53 




hair 


6.70 


5.44 


4.90 


34.62 



Table 1: Landmark distance statistics for global and local models. All measurements are in millimeters. 



model is closer to the input surface than the local model. 
The reason is that the initialization is often poor in these 
regions and that as a result the local model does not fit this 
area to the surface. For the models with occlusion, the ad- 
ditional error caused by the occlusion is generally more 
localized when using the local shape space than when us- 
ing the global shape space. This is especially visible in 
the region around the left eye for the examples where the 
right eye is occluded by a hand (third row of Figure [7]). 
For the local shape space, the region around the left eye 
has low average fitting error, while for the global shape 
space, this region has larger average fitting error because 
it is influenced by errors in the (symmetric) region around 
the right eye. 

6.4 Visual Qualitative Evaluation 

Figure [8] shows some examples of the fitted models for 
visual evaluation. Both models fit the shape model close 
to the input data for all of the examples. Note that overall, 
the results of the local method capture more shape detail 
than the results of the global method and that in most areas 
of the face, the results of the local method are fitted closer 
to the data than the results of the global method. A notable 
exception is the nose area of the subject shown in the last 
row of the figure. The reason is that the initialization is 
poor in this region, which is discussed in detail above. 



The third row of Figure [5] shows a facial expression that 
is asymmetric in the cheek area. The output of the global 
method is a fairly symmetric face since the global shape 
prior learned the symmetric structure of the face. The out- 
put of the local method correctly captures the asymmetry 
in the reconstruction since the local shape prior allows for 
more flexibility in localized shape differences. 

Figure [9] shows two results obtained using noisy and in- 
complete stereo and range data. The 3D stereo data used 
as input to our comparison is obtained using the approach 
by Brunton et al. (8 ] from two input images. The resulting 
point cloud has missing data, which is typical for data ob- 
tained using passive stereo approaches. For this dataset, 
both models fit the shape to the input data well. As in pre- 
vious experiments, the result using the global model con- 
tains less detail than the result using the local model. The 
range data used as input to our comparison is obtained 
using a Kinect sensor. This dataset has low resolution, 
missing data, and significant data noise. In spite of these 
problems, both models fit the shape close to the data in 
most areas. For this example, the result using the global 
model is better in the chin region than the result using the 
local model. The reason is that due to poor initial align- 
ment, the overall shape of the chin region of the result 
computed using the local model is far from the input sur- 
face. However, in the nose region, where the initial align- 
ment is good, the result using the local model is closer to 
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input global model local model 



Figure 8: Some fitting results. Each row shows from left to right: input data, result of global fitting, color coding of 
distances between global fitting result and input data, result of local fitting, color coding of distances between local 
fitting result and input data, and the legend for the color coding. 
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Model 


Occlusion 


Mean 


Median 


Std. Dev. 


Max 




none 


0.94 


0.75 


(\ i-ir\ 

0.70 


lz.04 


global 


glasses 


1.10 


0.88 


0.79 


9.01 


eye 


2.34 


1.22 


3.88 


67.37 




mouth 


3.14 


1.53 


4.59 


49.03 




hair 


Z.^t / 






J7.0U 




none 


0.82 


0.54 


0.93 


15.53 


local 


glasses 


0.93 


0.61 


1.02 


15.70 


eye 


1.96 


0.76 


4.08 


68.37 




mouth 


2.61 


0.93 


4.53 


52.69 




hair 


1.96 


0.75 


4.02 


45.13 



Table 2: Nearest neighbor distance statistics for global and local models. All measurements are in millimeters. 



stereo data 



Kinect data 




input 



global model 



local model 



Figure 9: Fitting to noisy data. Top row from left to right: left input image of stereo data, stereo point cloud, result of 
global fitting, and result of local fitting. Bottom row from left to right: input Kinect frame, result of global fitting, and 
result of local fitting. 



the input data than the result using the global model. 



7 Conclusion 



6.5 Limitations 

Both methods are limited to datasets in which most of 
the surface of interest is not subject to occlusion. A case 
where the fitting results of both methods are not satisfac- 



tory is shown in Figure 10 Here, a large part of the input 
face is occluded by hair. As a result, the result of the 
global method is a plausible human face that is visually 
far from the input face. The result of the local method can 
achieve a visually better fitting result in areas of the face 
that are not subject to occlusion, such as the left side of 
the face. However, in occluded regions, the result suffers 
from the same drawbacks as the result obtained using the 
global method. 



In this paper we have performed a comparative analysis, 
both theoretically and experimentally, of global and lo- 
cal statistical shape models. We have found the following 
differences between the two types of models: Local mod- 
els capture details better at the cost of greater computa- 
tional requirements. This is in part due to the optimization 
strategy used in this investigation (a sampling strategy that 
avoids local minima), but is also due to the much higher 
dimensionality of the local model. The global model has 
much lower dimensionality and can thus be fitted to input 
data much faster. In some cases, the global model better 
captures the overall shape, height and width, of the face. 
The local model avoids overfitting, because local surface 
patches are not likely to be biased for a particular database 
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Distance [mm] 




Figure 7: Histograms and false color visualizations of the 
mean nearest neighbor distances for different types of oc- 
clusion. 




input global model local model 



Figure 10: The fitting results are not satisfactory in cases 
with extreme occlusion. 

the way the shape of the entire face can be. Put another 
way, we can reasonably assume that local surface patches 
from human faces have much lower variation than do en- 
tire faces, hence a limited training set has a better chance 
of capturing the full variability for the local model. The 
local model also better contains erroneous reconstruction 
due to occlusion to the affected areas, whereas the global 
model typically captures approximately symmetric shapes 
of human faces. Thus, an occlusion of the left side will 
cause poor fitting on the right as well. The wavelet trans- 
form has rank equal to the resolution with which the sur- 
face is sampled, whereas the rank of the global PCA trans- 
form is limited by the number of training samples. Hence, 
the local method can capture additional details by subdi- 
vision resampling. 

In this paper, we have used a bare-bones fitting en- 
ergy to illuminate the differences between the two types of 
models. More terms could be added to this energy, such as 
smoothing terms, or landmark terms. We have also used 
manually placed landmarks to avoid landmark misplace- 
ment as a source of error. Automatic landmark location 
and initial alignment are clear avenues for future work. 
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